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A Model of Asynchronous Iterative Algorithms for 
Solving Large, Sparse, Linear Systems 
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Department of Computer Science 
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ABSTRACT 


Solving large, sparse, linear systems of equations is one of the fundamen- 
tal problems in large scale scientific and engineering computation. A 
model of a general class of asynchronous, iterative solution methods for 
linear systems is developed. In the model, the system is solved by creat- 
ing several cooperating tasks that each compute a portion of the solution 
vector. This model is then analyzed to determine the expected intertask 
data transfer and task computational complexity as functions of the 
number of tasks. Based on the analysis, recommendations for task parti- 
tioning are made. These recommendations are a function of the sparse- 
ness of the linear system, its structure (i.e., randomly sparse or banded), 
and dimension. 

^Thc research reported here was supported in part by the National 
Aeronautics and Space Administration under NASA Contracts No. 
NAS1-17070 and No. NAS1-17130 and was performed while the authors 
were visitors at ICASE, NASA Langley Research Center, Hampton, VA 
23665. In addition, the second author was also supported by Control 
Data Grant No. 80D05. 
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Introduction 

In this paper we focus on iterative methods for solving large, sparse linear systems 
on MIMD computers. In related work, Adams [1] has studied parallel implementations 
of iterative methods for the linear systems arising from finite element analysis, with par* 
ticular emphasis on mapping the methods on the FEM [3], an MIMD machine being 
developed at the NASA Langley Research Center. She has also developed models for 
predicting the performance of these algorithms and validated them using the FEM [2]. 
Gannon and Van Rosendale [6] have also recently proposed a parallel architecture for 
another class of iterative algorithms based on multigrid methods. Finally, Amano, 
Yoshida, and Aiso [5] have proposed a parallel architecture, called the Sparse Matrix 
Solving Machine, (SM) 2 , for iteratively solving sparse linear systems. 

In the remainder of the paper, we precisely define both the problem and the class of 
iterative methods used to solve it, and we discuss one possible implementation. We then 
define a probabilistic model for predicting iteration time and an optimal number of data 
partitions given the dimension and sparsity of the coefficient matrix and the costs of 
computation, synchronization, and communication. We conclude with graphs and ana- 
lyses of execution time as a function of the number of matrix partitions for various 
parameter values. 

Problem Definition 

Consider a linear system of equations of the form 

Kx = f (1) 

where K is a large N X N sparse matrix and z and / are vectors of length N. Such sys- 
tems are frequently rewritten in the form 

x = Az + c 

and solved using the iteration formula 

*(•■+») = + c (2) 

where z and c are N - vectors and A is another sparse N X N matrix. Although A is a 
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function of K, and c is a function of K and the / vector, there are many ways to choose 
A and c such that (2) describes a convergent iterative scheme for (1). We only assume 

that they are chosen such that the sequence of iterates <a^> converges to the solution. 

Henceforth, we consider only the parallel implementation of the computation defined by 

( 2 ). 

Parallel Solution Technique 

One parallel computation schema for (2) is illustrated by the diagram in Figure I. 
The matrix A and the vectors c and a (,+ (denoted by XN) are partitioned into sets of 
rows. A basic iteration step of the computation is then partitioned into the set of com- 
putations defined symbolically by 

XSET [I ] = ASET (I ] * XO + CSET (I ] 1 = 

After each basic iteration, a norm of the vector XN - XO must be checked for 

convergence. If convergence has occurred or the maximum allowable number of itera- 
tions has been exceeded, the iteration halts; otherwise XO is replaced by XN, and the 
iteration step is repeated. 

This computation schema can be realized as follows. In the main program, the 
data objects and their types are declared. In addition, worker tasks, called X_TASKs, 
and their controlling task, the C_TASK, are defined. The body of the main program 
reads the input data, initiates the control task, which in turn initiates the worker tasks, 
and prints the solution vector after the control task has terminated. 

X_TASK[I] computes the components of the vector XN corresponding to XSETfl]. 
To accomplish this, X_TASK[I] needs the non-zero elements of A corresponding to 
ASET[I], the elements of c corresponding to CSET[I], and a portion of the vector XO, 
specifically those elements whose subscripts are the same as the column subscripts of the 
non-zero elements of A in ASET(I|. Initially, this information is sent to each of the 
X_TASKs by the C_TASK. After each iteration, if convergence has not occurred, the 
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Figure I Partitioning of a linear system 
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vector XO is replaced with the vector XN. Because of this replacement, each X_TASK 
must send those components of XN that it computes to the other X_TASKs that need 
them. Each X_TASK then determines if the XN components just computed have con- 
verged and notifies the C_TASK accordingly by sending a boolean flag. When an 
X_TASK is notified by the C_TASK that all X_TASKs report convergence of their com- 
ponents, it sends the current values of its XN components to the CJTASK and ter- 
minates. 

The role of the C_TASK is now clear. After initializing the X_TASKs and sending 
them the initial data they need to iterate, it receives local convergence information from 
each X_TASK, determines if global convergence has occurred, and notifies the XJTASKs 
accordingly. If global convergence has occurred, the C_TASK then receives the com- 
ponents of XN and terminates. 

An Analytical Model of the Computation 

Objectives 

Because the intent of parallel computation is a reduction of the expected execution 
time, we must consider the performance of the parallel, sparse, linear systems iteration 
algorithm just described. Unlike sequential algorithms, the performance of a parallel 
algorithm depends not only on the number of arithmetic operations but also on the 
amount and frequency of intertask data transfer. Consequently, we derive formulae 
describing 

• the amount of data transfer among X_TASKs needed for each iteration, 

• the computational complexity of each X_TASK, and 

« the time to synchronize the XJTASKs. 

Based on these formulae, we create a model for predicting performance as a function of 
both the number and size of matrix partitions and the matrix sparsity. This perfor- 
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mance prediction model is then applied to both the general case of randomly sparse 
matrices and an important special case, the band matrix. 

Notation and Assumptions 

Unless otherwise specified, we assume the elements of the matrix A are randomly 
non-zero with probability P (i.e., p(«,y7^ 0) = P). In our model of matrix sparsity, 
the probability function P is determined by imposing two very weak conditions on A. 
First, we require each row of A to contain at least Z non-zero elements, each randomly 
distributed throughout the row. Second, each row element not known to be one of the Z 
non-zero values is itself assumed to be non-zero with probability q. 

Given the two conditions above, the value of P can be derived using a straightfor- 
ward application of conditional probabilities. We define two events: 

A: a,y is one of the Z non-zero elements in row i 

B: a,y is a non-zero element with probability q 

Then 

P{N,Z,q) = pK^O) 

= p(A) + p(B) - p(A and B) 



” ~ 9) + q. 


Finally, we require Z to be greater than zero. Otherwise, this sparsity model includes 
matrices containing one or more identically zero rows; the consequent singularity must 
be avoided. 

Throughout our discussion, M denotes the number of partitions of A (i.e., the 
number of X_TASKs), and 6y and ey respectively denote the indices of the beginning and 
ending rows of partition j. This notation, and that introduced throughout the 
remainder of our analysis, is summarized in Table I. 
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Table I Notation 


Quantity 

Definition 

A 

arbitrary NX N sparse matrix 

b 

matrix semi-band width 


initial row of partition j 

c 

arbitrary constant N-vector 

cm 

transmission time for a boolean as a function of 
the number of partitions 

c ? 

computation time for arithmetic operations 

c. 

startup time for data transmission 

cm 

transmission time for one datum as a function of 
the number of partitions 

e i 

final row of partition j 

M 

number of matrix partitions 

N 

matrix dimension 

P{N, Z, q) 

probability that a matrix element is non-zero 

p rX k > j) 

probability that partition k transfers data to parti- 
tion j 

Q 

probability that a matrix element is non-zero given 
that it is not known to be zero 

S 

fixed partition width 

t^comm) 

communication time for one iteration at partition j 

t 3 {comp) 

computation time for one iteration at partition j 

Tr[k, f) 

expected data transfer from partition k to j 

X 

N - vector of unknowns 

Z 

number of known non-zero elements in a row 





Data Transfer for Sparse Matrices 

Given a sparse matrix A whose elements are randomly non-zero with probability 
P{N, Z, q) and two partitions j and k, we wish to determine the data transfer from parti- 
tion k to partition j needed to perform one iteration. For pedagogic purposes, we con- 
sider three cases of increasing generality. 

Cate I: j and k are single row partitions 

Partition j requires the single value x ik if and only if a i4t 5^ 0. Since this occurs 

with probability P(N, Z, q), the expected data transfer from k to j is simply 
f\N, Z, q). 

Case II: j is a multiple row partition; k is not 

Clearly, partition j does not need if and only if a,- $ = 0 for all » in the range 
bj < » < Cj. By assumption, each matrix element is randomly non-zero. Hence, 
the probability that at least one element of the column 6* in partition j is non-zero 
is 

1 - [l - 

Because partition k contains only one row, the expected data transfer from k to j is 
the same. 

Cate III: both j and k are multiple row partitions 

This case is illustrated in Figure II. An immediate generalization of the previous 
case, partition j does not need any z* if and only if a l{ 7^ 0 for » in the range 
bj < 1 < Cj and all / in the range 64 < / < e*. Consequently, the expected data 
transfer from partition k to partition j is just (e* - 6* + 1) times that of case II, 
namely, 

Tr(k,j) = (3) 

(e* - 6* + 1) [l - [l - P(N, Z, ?)] *' + 1 ] . 

Finally, the probability, PtX^ 9 j)y that partition j needs at least one element from 




Figure II 

Data transfer between multiple row partitions 
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partition k is just the probability that the submatrix delimited by rows 6y and ey 
and columns b k and e k , matrix A in Figure II, is not identically zero. This probabil- 
ity is just 

P T Xk, j) = 1 - [l - P(N, Z, ?)] .<*' ’ + lXe * ‘ ** + l) (4) 

Although general, (3) and (4) provide little insight or intuition about data transfer 

as a function of either P{N, Z, q) or M. If the partition size is constant, simpler 
expressions can be obtained. Hence, we fix (ey - 4y + 1), the partition size, at a con- 
stant S = N/M for all partitions. Then replacing P(N, Z, q) by its definition, we 
obtain 

- % 

and 

P Tiib j) — 1 - 
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Data Transfer for Band Matrices 


In addition to randomly sparse matrices, there are many other sparse matrices with 
discernible structure, notably band matrices. For a band matrix A with semi-bandwidth 
6, a,y 7^ 0 only if |» - j\ < b. 

Applying the sparsity model derived for the random sparsity case, the probability 
P^ fn \N, Z, q) that a {j ^ 0 is given by 


/f ”'(W, Z , ,) 


P(26, Z, ?) if |i - j| < b 


[ 0 otherwise. 

Unlike the random sparsity case, all elements of the band matrix are not non-zero with 
equal probability. Hence, a direct substitution of P\j n \N, Z, q) into (3) is inappropriate. 
Consider, however, a single column m of partition j where b k < m < e k . As with 
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random sparsity, partition j does not need element x m if and only if column m is identi- 
cally zero. This occurs with probability 


fi fl - Pf'iK Z, ?)] b„ < m < c t . 


Hence, the probability that partition j needs z m is 


1 - n fl - Z, 9)1, 


and the expected data transfer from partition k to partition j is 




£ 

m=6* 


l 



( 5 ) 


Now consider column m, shown in Figure III. It can only contain non-zero elements if it 
lies between columns 6y - 6 and ey + 6 inclusive. Otherwise, it would lie outside the 
intersection of the matrix band and partition /. Moreover, column m can only cause 
data transfer from partition k to partition j if it lies between columns 6* and e 4 inclusive. 
Hence, the structure of the band matrix implies that 


max | b k , 6y - 6 j < m < min | e t , ey + 6 

Now consider the rows l associated with column m. By the definition of a band matrix, 
non- zero elements in column m must lie between rows m - b and m + 6 inclusive. More- 
over, the rows are constrained to lie within the partition j. Hence, 



max | 6y , m - 6 1 < l < min | ey , m + 6 j . 

Within these constraints on / and m, rfmXN, Z, q) is just P{2b, Z, q ). Hence, (5) reduces 
to 


1 - fi[l - fl(26, Z, ?)] 


( 6 ) 


where 




max 
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Figure HI 

Band of non-zero elements intersecting partition j 
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= min | e k , e ; + tj , 
Pl = max | bj , m- fcj, 


and 


p„ = min | c ; , m + fcj . 

To obtain a closed form, we again reduce the problem to one of fixed size partitions 5. 
Then the limits on (6) simplify to 


«l = max 


{(fc - 1)5 + 1 , (/- 1)5 + 1 - fcj, 
s u = min |a;5 , j'S + fcj, 

Pl = max|(y - 1)5 +1 , m - fcj, 


and 


p u = min j/S , m + 6| . 

Further simplification of this summation unfortunately requires enumeration of several 
cases. These cases are a function of the relationships among the matrix bandwidth, the 
partition size, and the relative positions in the matrix of the partitions j and k. Fortui- 
tously, those cases where j > k are symmetric with j < k. Hence, we consider only the 
case j < k. Derivation of the remaining cases is still a lengthy endeavor, providing little 
insight. Consequently, we simply describe the cases, using Figure IV, and enumerate the 
results. 

Case I: {k - 1)5+1 > jS + 6 

Here, the submatrix determining possible data transfer from partition k to partition 
j lies outside the matrix band. Consequently, the submatrix is identically zero and 
no data transfer occurs. This case arises if 


* - / > 


6 - 1 

R 


+ 1 . 



((/- 1)5 + 1 ,(/- 1)5 + 1 + 6 ) 



(jS,jS-b) US , jS) ( jS,JS+b ) 


Figure IV 

Data transfer cases for band matrix with fixed partitions (part I) 
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For the remainder of the cases, we implicitly assume that some data transfer occurs 
(i.e., (k - 1)5+ 1 <jS+ 6). 

Case II: b<S 

In this case, the partition width exceeds half the bandwidth. Two subcases, based 
on possible positions of k, arise. 

Subcase Ha: k > (/ + 1) 

This condition, coupled with that of case II, places the determining submatrix out- 
side the band, and no data transfer occurs. 

Subcase lib : k = (/ + 1) 

Partitions j and k are adjacent. Moreover, partition k is the only partition transfer- 
ring data to j such that k > j. The expected data transfer is 


Tr{k, 


(7) 


[l - /W Z, «/)] [P[N, Z, q) b - ll 
,y) + P[N,Z,q) 

Similarly, only partition / - 1 transfer data to j from the other side. Hence, if 

6 < S, only adjacent partitions must exchange data. This suggests that this parti- 
tion size for band matrices might be well suited to a ring architecture. 

The probability that partition k must transfer data to partition j is again just the 
probability that the submatrix is not identically zero, or 


jR+b )R r 

PrXKi) - n n i - wz, g ) 

m=;7?+l h=m-b L 

r 

[ 1 - P(N, Z, 7 )J 2 . 

Case III: b > S 


The converse of case II, the partition size is less than half the matrix bandwidth. 
As before, subcases based on the possible positions of partition k arise. 


Subcase Ilia: kS < (j - 1)5 + b 


Here, the determining submatrix lies completely within the band, and the expected 




with fixed partitions (part II) 
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data transfer is 

s[i - [i - wz,«)] s ]. 

Subcase Illb: (/ - 1)5 -f 6 + 1 < kS < jS + 6 

The determining submatrix lies partially within the matrix band, and every column 
also lies partially within the band. If T denotes the column of partition / where the 
last column of the determining submatrix intersects the right edge of the band, the 
expected data transfer from partition k to j is 


s - r[i - [i - wii] 1 ] 


r* ^ 

1 - P[N, Z, q) > s+i+l 

i f 

T - 1 

( 1 ) 

P[N, Z, i) 

ll - P[N,Z,q) ) 

U - P[N,Z,q ) J 


Subcase IIIc: kS > jS + 6 

Finally, the determining submatrix can lie partially within the band with some 
columns entirely outside the band. This leads to an expected data transfer of 


(j — k + 1)5 + 


1 mi, z i "] [[■ - wz 9 )]«'-‘-» s+ ‘- 


«]• 


Parallel Computational Complexity 

As noted earlier, the performance of a parallel algorithm depends on both the inter- 
task data transfer and the amount of computation performed by each task. Having con- 
sidered the former, we turn our attention to the latter. 

Each of the parallel X_TASKs is itself just a sequential code whose two primary 
constituents, inner product and convergence test, were described earlier. Consequently, 
we can apply standard techniques [4] to determine the complexity of each X_TASK. 
The results of this analysis are shown in Table II. 

We assume that all indexing and arithmetic operations require the same amount of 
time C f . Combining the results for the inner product and convergence test, the compu- 
tational complexity of an arbitrary X_TASK is 
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Table II Computational Complexity of X_TASKs 


Loop 

Statement 

Statement 

Cost 

Cost 


(1) 

2C p 

FOR I E (J] to B [J] DO 



BEGIN 


C P 

SUM 0; 

(2) 

2C p 

FOR K L [J] to U [J] DO 


6Cp 

SUM SUM + 

ANZ (K| • X0 (COLSUB (K]^ 


4C p 

XN (I) SUM + C (I) 
END; 


c. 

CONVERGED TRUE; 

(3) 

2C p 

FOR I E (J) to B [J] DO 



BEGIN 


5C p 

IF ABS (XN [I] - X0 (I)) > EPS THEN 


Cp 

CONVERGED FALSE 


3Cp 

X0 (I) XN [I] 
END; 


(1) : (ey - bj + 1)C p [q{N or 2b)P{N, Z, q) + 7 ] + 2 C p 

(2) : QC p {N or2b)P{N, Z, q) + 2 C p 

(3) : (ey - bj + 1)9 C p + 3Cp 


^ANZ and COLSUB are vectors of the non-zero elements of A and the 
corresponding column subscripts, respectively. L [J] and U [J) denote the begin- 
ning and ending indices of components of these vectors belonging to partition J. 
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(cy - bj + \)C^NP(N, Z, q ) + 16) + 5C p (8) 

for the random matrix and 

(«; " ^ + !)C,Ll2bP(N, Z, q) + 16) + 5C7, 
for the band matrix. The C_TASK must also check for global convergence after each 

iteration. This consists of ANDing the M local convergence flags received from the 

X_TASKs and requires 

(3A/ + 1 )C p 

operations. 

Model Description 

Having just determined the expected amount of data transfer among X_TASKs 
(partitions), and their computational complexity, we can now define an execution time 
model of the parallel, sparse matrix algorithm. This model can then be used to predict 
the execution time of one iteration. 

Let t^comp) denote the computational complexity of X_TASK j, t } (comm) denote 
the time required for task j to send and receive all data needed for the next iteration, 
and t(sync) be the time required for the C_TASK to receive and test all local synchroni- 
zation flags. Then the total execution time for one sparse matrix iteration is 

t(sync) + ^ ^nax ^|f ; (comp) + t } (comm) J. (9) 

Clearly, the time required to transmit or receive a datum is some function of the 
number of partitions (X_TASKs) concurrently operating (e.g., if only two X_TASKs 
were operating in parallel, they should be able to exchange data more quickly than if 
fifty additional X_TASKs were also operating). Hence, we make both the time needed 
to transmit a boolean, Cj(Af), and the time to transmit an x value, C^M), functions of 
M. 

We now consider each component of the execution time. Given that Cj(Af) denotes 
the time needed to transmit a single boolean value, then t(tync) is given by 
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RECEIVE FLA GS TEST FLA GS SEND FLA GS 

MC k {M) + (3 M + l)C p + A/CtfAf). 

Of course, tj(comp) is given by (8). The communication component, t } (comm) is, how* 

ever, somewhat more complicated. In addition to including the interpartition data 
transfer, it should also include startup costs for data transmission. That is, two parti- 
tions exchanging ten data values should require less time than four partitions exchanging 
five data values. This intent is reflected by the formula 


t } (comm) = tend to other partitions (10) 

+ receive from other partitions 

= E [ C,P T Xk, j) + ClM)Tr[k,j)} 

+ S CfM)Tr{k,j) 

k~l 

where C, is the startup cost for initiating a data transfer. 


Given these formulae, consider the two matrix cases for which we derived closed 


forms for PtX^iJ) and Tr(k, j), the randomly sparse matrix and the band matrix both 


with fixed partition size. 


Randomly Sparse Matrix 


Substituting values in (10) for PrXk, j) and Tr(k, j) gives 


tj anion \comm) — 






N 

2 

CfM - 1) 

1 - 

(1 - ?) 

1 - "I 

N 1 

M 



. 


L Jm 


m 


2CfM)N{M - 1) 





N 

1 - 

(1 - 7 ) 

1 - *- 

N 

M 


M 
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Band Matrix 

For the band matrix, case lib, we have 


t,{comm) = 2<?|(1 - q) 
(1 - 9 ) 


1--* 

26 


niiLLil 


2^iW) 


6 + 


1-4 

26 


1 - 


26 


1* 


(1 - ?) 


-1 


^d-g) + g 


Conclusions Based on the Model 

As we have seen, the total execution time for one sparse matrix iteration is given 
by (9). For equal sized partitions, (9) simplifies to 

t(sync) + t/comp ) + t/comm). (11) 

There are two primary means of implementing communication in a parallel system, 
shared memory and communication networks. In both cases, the delays incurred for 
data transfer increase as the number of parallel tasks increase. (Shared memory suffers 
from memory access conflicts, and communication networks, being necessarily incomplete 
connections, require additional routing of data.) Hence, it seems appropriate to make the 
synchronization and data transmission costs functions of the number of partitions M 
(i.e., the number of parallel X_TASKs). We used the functions 


1 


m 


log 2 (A/) 

s/M 


M 


in the communication component of (11) to reflect the possible range of communication 
costs one might encounter in a complete connection, tree, square mesh, and ring, respec- 
tively. 
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Using (11) and the communication cost function, /{ M), we then plotted total execu- 
tion time as a function of matrix sparsity, P(N, Z, q), computation time, <> communica- 
tion time, Ct and C„ and synchronization cost, (7$, for the random sparsity case. These 
plots, shown in Figures V-VII, are discussed in detail below. In all cases, the smallest 
number of partitions chosen was M = 5. 

Figure V 

This figure shows iteration time as a function of the number of matrix partitions 
(X_TASKs) for varying communication costs. Each matrix row contains 14 non-zero ele- 
ments, a typical number for a matrix arising from a finite element method. 

As can be seen, there exists an optimal level of parallelism in each case. Not 
surprisingly, the optimum level of parallelism declines as the communication costs 
increase. Even the complete connection cannot support as many parallel tasks as there 
are matrix rows. The reason is quite simple, as the number of partitions grows, syn- 
chronization costs become prohibitive. 

Figure VI 

This figure shows the effect of matrix sparsity on iteration time for communication 
costs proportional to the lowest curve corresponds to greatest sparsity. As 

expected, increasing the number of non-zero elements results in increased iteration time. 
In addition, the optimum level of parallelism increases as the number of non-zero ele- 
ments increases. 

Figure VII 

Finally, this figure shows iteration time for varying matrix sizes, again with com- 
munication costs proportional to VM. 
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Band Matrices 

The execution time model for banded matrices, case A, is easier to analyze. We 
have seen that intertask data transfer occurs only between adjacent tasks if the width of 
a partition is at least as large as the matrix semi-bandwidth. If this condition is met, 
the optimum number of partitions (X_TASKs) depends on the relative costs of computa- 
tion and communication. 

Summary 

As we have seen, the performance of a parallel algorithm depends not only on the 
amount of computation performed by each task but also on the amount and frequency 
of intertask data transfer and task synchronization. 

For a parallel implementation of iterative methods for solving sparse linear systems 
of equations, we have derived the expected intertask data transfer and defined an execu- 
tion time model that can be used to predict iteration time. We have applied the model 
to both the general case of randomly sparse matrices and one important special case, 
banded matrices. 

Results of the model clearly show that the execution time of the solution methods 
can be reduced by partitioning the computation into parallel subtasks. However, the 
optimum number of partitions is very dependent on synchronization and communication 
costs. 
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